========================================================
In this analysis we explore dataon real estate transactions in 2016 over a five day period. The key goal of this study was to determine what were the main factors in the dataset which drove the price of the real estate transactions.
SpatialKey is a private company which buids geospatial insturance technology that simplifies how insureers view risk by building catastrophe data, mapping, and analytics. As part of their training resources, Spatialkey provides a plethora of datasets with geospatial information, which can be downloaded from this url: https://support.spatialkey.com/spatialkey-sample-csv-data/
This.csv file is the Sacramento real estate transcations (SRET) file provided by SpatialKey. The file is a list of 985 real estate transactions in the Sacramento county reported over a five-day period in 2016, as reported by the Sacramento Bee.
The Sacramento Bee is the largest newspaper in Sacramento, the fifth largest newspaper in California, and the 27th largest paper in the USA.
Sacramento County is a county of the U.S State of California, and according to the 2010 census the population was 1,418,788. Also it contains the state capital of California, Sacramento, which has an estimated popualtion of 493,025; making it the 6th largest city in California.
The SRET dataset was chosen as it provides a rich range of datatypes, in a tidy format, with the option for practicing multiple editing, analytical and visual techniques.
The datset itself contains the following variables:
street: The physical address of the property. Categorical variable, and it will be dropped due to redundancy.
city: The city of the property within the Sacramento county. Categorical variable.
zip: The USPS Zone Improvement Plan used in postal addresses in the USA. Categorical variable organized in a numeric format.
state: The state in which the poperty is located. Categorical variable, which will be dropped as it does not provide useful information. All proerties are in the state of California
beds: The number of beds the property has. Quantitative variable
baths: The number of bathrooms the property has. Quantitative variable
sq__ft: The square feet the property has. Quantitative variable which will be converted to square meters
type: The type of property. Categorical variable
sale_date: The day the property was sold. timedate variable which will be parsed into day
price: The price at which the propery was sold. Quantitative variable
latitude: The geospatial location based on latitude of the property. Geospatial data
longnitude: The geospatial location based on longnitute of the property. Geospatial data
Please note, all referenced websites were viewed on April 18th 2018
First I review the dataset dimensions, variables and structure
The above dataset has a few variables which I want to drop or change. Particularly I want to remove street, stat and sale_datee, and convert sq__ft to square meters.
df <- subset(df, select = -c(street, state))# drops street and state variable
df['sq_m'] <- df$sq__ft*0.093# add a new column for data which is square meters
#test <-strsplit(as.character(df$sale_date), 'May') #parse data
#df['sale_day'] <-do.call(rbind, test) #add the day to the datafram
#df <- subset(df, select = -c(sale_date)) #drop the datetime
df <- subset(df, select = -c(sq__ft)) #drop the square feet
I review the dataset after the transformation.
As a quick overview of the cities included in the analysis, I perform a quick, aggregate of the variable city using a table, and then by visualizing the data in a bar and pie chart
The pie chart code is derived from the following url http://www.sthda.com.
What stands out form this initial analysis is that there are 3 cities with an extraordinarly higher number of transactions relative to the other cities. These three cities are Sacramento, Elk Grove and Lincoln.
From this point onwards I perform similar exploratory analysis and visualizations on the remainding variables.
From initial inspection it does not seem that the variable type will provide much insight into the data.
summary(df$beds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.912 4.000 8.000
ggplot(df, aes(x=beds, fill=beds)) +
geom_bar()
There is a condierable number of missing data in beds, which can be later removed. Also, This data seems like it would provide useful information for understanding the transactional patterns in the data. Also at present this variable is an integer which should be converted to a factor variable to really appreciate its usefulness.
summary(df$baths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.777 2.000 5.000
ggplot(df, aes(x=baths, fill=baths)) +
geom_bar()
The variable baths has similar properties as the variable beds.
summary(df$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1551 145000 213800 234100 300000 884800
ggplot(aes(x = price), data = df) +
geom_histogram() +
scale_x_continuous()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is one of the most important variable sin this dataset as we are studying property sales. Visually we note a high number of missing values and that the data is skewed to the right due to high sales numbers.
This variable shares similar proerties like the varibale price
With this initial univariate analysis, there are interesting findings to report:
There is a substantial amount of property located in Sacramento (n=434), followed by Elk Grove (n=114) and Lincoln (n=72).
Most properties are reported as Residential.
Most properties have 3 bedrooms and 2 bathrooms.
The median cost of properties is $213,800, however there are also outlier above $750,000 with a max of $884,800.
The median square meter space of the properties is 121.3 m^2, with the third quartile in 159.80m^2. Interestingly the max area reported is 541.4.
In the square meter area for properties there is a significant amount of properties that report 0 as their area. The same for price.
Some initial questions which this analysis provokes are:
It would be interesting to see if there are any particular properties of Sacramento, Elk Grove and Lincoln, relative to the available data, that can suggest why they are so popular as a sales area.
Additionally it will be interesting to run similar analysis for properties which are considerably outside the normal ditribution of the data. For example, this can include data for properties which are exceptionally pricey or that have large surface area for sale.
Likewise it will be interesting to see if there are any particular patterns for unreported data (eg. they may be from the same area, or for certain listings).
The variables ‘’latitude’ and ‘longitude’ contain geospatial data. This data can be used to create a scatterplot map of the property areas. The goespatial information can be used to visually observe sales patterns in the data. Potentially helping us to generate hypothesis to answer questions raised in the Univariate analysis section.
Leveraging the geospatial data, we tag its color using the city variable
With the above information we can overlay a map to observe how prevalent sales are relative to the city in which they were sold.
Therefore, this data is combined with map images pulled from google maps
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
This mapping procedure was derived from teh following url: http://eriqande.github.io/rep-res-web/lectures/making-maps-with-R.html
From this analysis we see that the three largest areas are representative of the following: 1) Sacramento is Californias capital city and therefore a main point of concentration for sales. 2) The next two main areas for sales (i.e. Elk Grove and Lincoln) are in the immediate vicinity of Sacramento (the south and North East, respectively). We also see that the further the areas are from Sacramento the less number of sales transcations occur.
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
Bsaed on the map image above it does not seem that the type of property affects the geospatial distribution of the sale. It seems that the type of property is not a useful variable for describing the data. Frpm this point on I will not include the type of property in the analysis.
Knowing that the above variables for beds and baths are set as an integer, I convert them to factor variables and plot them in relation to the size and price of the property sold.
df[,'beds_factor'] <- as.factor(df[,'beds']) #convert data to factor variables
df[,'baths_factor'] <- as.factor(df[,'baths'])
The above grid method layout was derived from the following URL: https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
At this point I replace 0 with NA and remove from the plot
df[, 11:12][df[, 11:12] == 0] <- NA #convert 0 to NA
And then I replot the above graphs excluding the missing variables
The method of NA removal was obtained from the following URL: https://stackoverflow.com/questions/11403104/remove-unused-factor- levels-from-a-ggplot-bar-plot
By converting the number of bedrooms and the number of bathrooms to a factor variable, we unveil a relationship in which in the size per square meter and the price increase relative to the number of bathrooms and bedrooms.
Altough it is intersting to see that this relationship does not exactly hold for properties with 6 or more bedrooms or 5 bathrooms. Other aspects like location may have influenced this outcome for price. Though it is not clear why th size of location would decrease relative to the number of rooms.
Given the relationship of bedroom and bathroom with price and size, plot thevariable beds and baths relative to the geospatial data.
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
In general it is difficult to discern any patterns, other than areas further away from Sacramento (South, South-East, North and North-East) have a higher number of bedrooms relative to Sacramento.
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
Like number of bedrooms, there is a distribution which seems to lead to increased number of bathrooms the furhter away one mooves from Sacramento.
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
The above map suggests that tmore expensive areas are found outside of the Sacramento area, and more towards the north east around Granite Bay, Folsom, El Dorado Hills, and Cameron Park. While a couple of more expensive sales also occured in the bottom right corner of the map around the Wilton area.
## left bottom right top
## -121.59941 38.20255 -120.54989 39.05977
While not as prominent a feature, as one goes out of Sacramento area the square meter area per property tends to increase.
Prior analysis suggests that there is a linear relationship between size and property. Therefore to study this relationship I perform a scatterplot.
In general there is a visual relationship betwen price and size. However this relationship seems to hold until around $375,000.
To confirm this observation I plot a linera regression and lowess plot, on top of the scatterplot.
From here, I zoom in on to the data and then readjust the plot to redraw the regression lines as if the data after $375,000 was not considered
We see that when we consider the parsed data alone that there is a strong linear relationship between size and price.
At this point I hypothesize that the relationship between price and size is relative to the properties location. Similarly, as number of bedrooms and bath rooms relate to size, I plotted a similar scatter Plot
We see that most of the pricing data is determined by Sacramento, and that anything under or equall to the median number of bedrooms and bathrooms is generally agregated within the area represented by Sacramento.
Now that I have visually analyzed the data, I will run a few univariate linear regression models separately to determine which significant relationships cna be added to a multiple linear regression model.
##
## Call:
## lm(formula = I(price) ~ I(city), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -272000 -75200 -18028 47264 501264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 232496 18896 12.304 < 2e-16 ***
## I(city)AUBURN 173394 52091 3.329 0.000906 ***
## I(city)CAMERON PARK 35448 40819 0.868 0.385383
## I(city)CARMICHAEL 63188 30760 2.054 0.040224 *
## I(city)CITRUS HEIGHTS -45382 26338 -1.723 0.085204 .
## I(city)COOL 67504 110179 0.613 0.540239
## I(city)DIAMOND SPRINGS -16463 110179 -0.149 0.881250
## I(city)EL DORADO 14504 79045 0.183 0.854457
## I(city)EL DORADO HILLS 259203 29484 8.791 < 2e-16 ***
## I(city)ELK GROVE 38661 21457 1.802 0.071891 .
## I(city)ELVERTA -99630 57468 -1.734 0.083305 .
## I(city)FAIR OAKS 71004 40819 1.739 0.082273 .
## I(city)FOLSOM 182464 32406 5.631 2.37e-08 ***
## I(city)FORESTHILL -37678 110179 -0.342 0.732445
## I(city)GALT 4447 30300 0.147 0.883348
## I(city)GARDEN VALLEY 257504 110179 2.337 0.019639 *
## I(city)GOLD RIVER 125504 57468 2.184 0.029217 *
## I(city)GRANITE BAY 446237 65456 6.817 1.65e-11 ***
## I(city)GREENWOOD 162504 110179 1.475 0.140569
## I(city)LINCOLN -135957 22819 -5.958 3.59e-09 ***
## I(city)LOOMIS 334504 79045 4.232 2.54e-05 ***
## I(city)MATHER 5304 110179 0.048 0.961618
## I(city)MEADOW VISTA -2496 110179 -0.023 0.981928
## I(city)NORTH HIGHLANDS -96837 30300 -3.196 0.001440 **
## I(city)ORANGEVALE 46663 37791 1.235 0.217223
## I(city)PENRYN 274192 110179 2.489 0.012995 *
## I(city)PLACERVILLE 131367 39183 3.353 0.000832 ***
## I(city)POLLOCK PINES 7806 65456 0.119 0.905095
## I(city)RANCHO CORDOVA 30910 27890 1.108 0.268023
## I(city)RANCHO MURIETA 65254 65456 0.997 0.319063
## I(city)RIO LINDA -59769 35544 -1.682 0.092987 .
## I(city)ROCKLIN 149339 32406 4.608 4.61e-06 ***
## I(city)ROSEVILLE 92032 24546 3.749 0.000188 ***
## I(city)SACRAMENTO -34760 19593 -1.774 0.076361 .
## I(city)SHINGLE SPRINGS 42504 110179 0.386 0.699754
## I(city)SLOUGHHOUSE -230496 110179 -2.092 0.036702 *
## I(city)WALNUT GROVE 147504 110179 1.339 0.180968
## I(city)WEST SACRAMENTO -61796 65456 -0.944 0.345363
## I(city)WILTON 385012 52091 7.391 3.20e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108500 on 946 degrees of freedom
## Multiple R-squared: 0.4083, Adjusted R-squared: 0.3846
## F-statistic: 17.18 on 38 and 946 DF, p-value: < 2.2e-16
##
## Calls:
## m1: lm(formula = I(price) ~ I(city), data = df)
##
## ============================================================
## (Intercept) 232496.394***
## (18895.476)
## I(city): AUBURN/ANTELOPE 173394.406***
## (52091.231)
## I(city): CAMERON PARK/ANTELOPE 35448.051
## (40818.893)
## I(city): CARMICHAEL/ANTELOPE 63188.356*
## (30759.606)
## I(city): CITRUS HEIGHTS/ANTELOPE -45381.480
## (26337.725)
## I(city): COOL/ANTELOPE 67503.606
## (110178.609)
## I(city): DIAMOND SPRINGS/ANTELOPE -16463.394
## (110178.609)
## I(city): EL DORADO/ANTELOPE 14503.606
## (79045.446)
## I(city): EL DORADO/ANTELOPE HILLS 259202.563***
## (29484.103)
## I(city): ELK GROVE/ANTELOPE 38661.299
## (21456.756)
## I(city): ELVERTA/ANTELOPE -99630.394
## (57468.346)
## I(city): FAIR OAKS/ANTELOPE 71004.273
## (40818.893)
## I(city): FOLSOM/ANTELOPE 182463.783***
## (32405.473)
## I(city): FORESTHILL/ANTELOPE -37678.394
## (110178.609)
## I(city): GALT/ANTELOPE 4447.035
## (30300.170)
## I(city): GARDEN VALLEY/ANTELOPE 257503.606*
## (110178.609)
## I(city): GOLD RIVER/ANTELOPE 125503.606*
## (57468.346)
## I(city): GRANITE BAY/ANTELOPE 446236.939***
## (65455.848)
## I(city): GREENWOOD/ANTELOPE 162503.606
## (110178.609)
## I(city): LINCOLN/ANTELOPE -135957.130***
## (22818.455)
## I(city): LOOMIS/ANTELOPE 334503.606***
## (79045.446)
## I(city): MATHER/ANTELOPE 5303.606
## (110178.609)
## I(city): MEADOW VISTA/ANTELOPE -2496.394
## (110178.609)
## I(city): NORTH HIGHLANDS/ANTELOPE -96837.061**
## (30300.170)
## I(city): ORANGEVALE/ANTELOPE 46663.152
## (37790.951)
## I(city): PENRYN/ANTELOPE 274191.606*
## (110178.609)
## I(city): PLACERVILLE/ANTELOPE 131367.006***
## (39182.492)
## I(city): POLLOCK PINES/ANTELOPE 7806.273
## (65455.848)
## I(city): RANCHO CORDOVA/ANTELOPE 30909.535
## (27889.693)
## I(city): RANCHO MURIETA/ANTELOPE 65253.606
## (65455.848)
## I(city): RIO LINDA/ANTELOPE -59768.779
## (35543.899)
## I(city): ROCKLIN/ANTELOPE 149339.430***
## (32405.473)
## I(city): ROSEVILLE/ANTELOPE 92031.856***
## (24545.943)
## I(city): SACRAMENTO/ANTELOPE -34760.405
## (19592.803)
## I(city): SHINGLE SPRINGS/ANTELOPE 42503.606
## (110178.609)
## I(city): SLOUGHHOUSE/ANTELOPE -230496.394*
## (110178.609)
## I(city): WALNUT GROVE/ANTELOPE 147503.606
## (110178.609)
## I(city): WEST SACRAMENTO/ANTELOPE -61796.394
## (65455.848)
## I(city): WILTON/ANTELOPE 385012.006***
## (52091.231)
## ------------------------------------------------------------
## R-squared 0.408
## adj. R-squared 0.385
## sigma 108546.243
## F 17.182
## p 0.000
## Log-likelihood -12798.766
## Deviance 11146043479543.883
## AIC 25677.531
## BIC 25873.237
## N 985
## ============================================================
##
## Call:
## lm(formula = I(price) ~ I(beds_factor), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -278363 -75613 -22260 46740 601530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109646 36503 3.004 0.00274 **
## I(beds_factor)2 55782 37850 1.474 0.14091
## I(beds_factor)3 105967 36942 2.868 0.00422 **
## I(beds_factor)4 173614 37203 4.667 3.54e-06 ***
## I(beds_factor)5 289895 39475 7.344 4.78e-13 ***
## I(beds_factor)6 320321 75986 4.215 2.75e-05 ***
## I(beds_factor)8 172754 121066 1.427 0.15396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115400 on 870 degrees of freedom
## (108 observations deleted due to missingness)
## Multiple R-squared: 0.219, Adjusted R-squared: 0.2136
## F-statistic: 40.65 on 6 and 870 DF, p-value: < 2.2e-16
##
## Calls:
## m1: lm(formula = I(price) ~ I(beds_factor), data = df)
##
## ==============================================
## (Intercept) 109646.200**
## (36502.668)
## I(beds_factor): 2/1 55782.101
## (37850.081)
## I(beds_factor): 3/1 105966.715**
## (36941.946)
## I(beds_factor): 4/1 173613.893***
## (37203.359)
## I(beds_factor): 5/1 289895.342***
## (39475.092)
## I(beds_factor): 6/1 320320.467***
## (75986.363)
## I(beds_factor): 8/1 172753.800
## (121065.654)
## ----------------------------------------------
## R-squared 0.219
## adj. R-squared 0.214
## sigma 115431.572
## F 40.649
## p 0.000
## Log-likelihood -11463.587
## Deviance 11592269584516.680
## AIC 22943.174
## BIC 22981.386
## N 877
## ==============================================
##
## Call:
## lm(formula = I(price) ~ I(baths_factor), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -378801 -70352 -20599 52475 504438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159025 8047 19.762 < 2e-16 ***
## I(baths_factor)2 67474 9283 7.268 8.09e-13 ***
## I(baths_factor)3 221328 12372 17.890 < 2e-16 ***
## I(baths_factor)4 199741 26043 7.670 4.60e-14 ***
## I(baths_factor)5 595975 76764 7.764 2.31e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108000 on 872 degrees of freedom
## (108 observations deleted due to missingness)
## Multiple R-squared: 0.3152, Adjusted R-squared: 0.3121
## F-statistic: 100.3 on 4 and 872 DF, p-value: < 2.2e-16
##
## Calls:
## m1: lm(formula = I(price) ~ I(baths_factor), data = df)
##
## ===============================================
## (Intercept) 159024.667***
## (8047.057)
## I(baths_factor): 2/1 67474.309***
## (9283.396)
## I(baths_factor): 3/1 221327.477***
## (12371.643)
## I(baths_factor): 4/1 199741.070***
## (26042.747)
## I(baths_factor): 5/1 595975.333***
## (76764.028)
## -----------------------------------------------
## R-squared 0.315
## adj. R-squared 0.312
## sigma 107962.594
## F 100.337
## p 0.000
## Log-likelihood -11405.929
## Deviance 10163963662608.637
## AIC 22823.857
## BIC 22852.516
## N 877
## ===============================================
From the initial univariate analysis, we see that city relates best to the transaction listings, followed by number of bedrooms, then bathrooms, and finally square meters.
Using this insight the multivariate model will be created accordingly.
##
## Calls:
## m1: lm(formula = I(price) ~ I(city), data = df)
## m2: lm(formula = I(price) ~ I(city) + beds_factor, data = df)
## m3: lm(formula = I(price) ~ I(city) + beds_factor + baths_factor,
## data = df)
## m4: lm(formula = I(price) ~ I(city) + beds_factor + baths_factor +
## sq_m, data = df)
##
## ==============================================================================================================================
## m1 m2 m3 m4
## ------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 232496.394*** 124614.829*** 113118.076*** 68909.091*
## (18895.476) (34347.647) (32183.782) (30906.892)
## I(city): AUBURN/ANTELOPE 173394.406*** 192972.896*** 148385.426*** 199816.315***
## (52091.231) (45437.249) (42741.242) (40938.671)
## I(city): CAMERON PARK/ANTELOPE 35448.051 87090.987* 97113.738** 163656.195***
## (40818.893) (37318.735) (34941.173) (33898.800)
## I(city): CARMICHAEL/ANTELOPE 63188.356* 81857.599** 94526.504*** 91455.726***
## (30759.606) (26799.473) (25121.226) (23855.654)
## I(city): CITRUS HEIGHTS/ANTELOPE -45381.480 -36366.535 -23167.962 -7836.600
## (26337.725) (22925.978) (21549.351) (20524.504)
## I(city): COOL/ANTELOPE 67503.606 90555.327 97534.517 100249.768
## (110178.609) (95891.333) (89706.308) (85179.796)
## I(city): DIAMOND SPRINGS/ANTELOPE -16463.394 6588.327 13567.517 25269.831
## (110178.609) (95891.333) (89706.308) (85188.102)
## I(city): EL DORADO/ANTELOPE 14503.606 59672.417 73434.229 69652.137
## (79045.446) (68878.246) (64480.978) (61228.253)
## I(city): EL DORADO/ANTELOPE HILLS 259202.563*** 208972.941*** 188971.604*** 287260.182***
## (29484.103) (26986.879) (25383.023) (26199.489)
## I(city): ELK GROVE/ANTELOPE 38661.299 24816.562 28178.670 19026.113
## (21456.756) (18861.814) (17651.782) (16788.262)
## I(city): ELVERTA/ANTELOPE -99630.394 -89807.987 -74842.787 -55989.823
## (57468.346) (50019.442) (46824.851) (44505.488)
## I(city): FAIR OAKS/ANTELOPE 71004.273 88435.662* 63439.741 52768.257
## (40818.893) (35568.771) (34292.673) (32581.201)
## I(city): FOLSOM/ANTELOPE 182463.783*** 162608.628*** 157366.106*** 139638.627***
## (32405.473) (29451.061) (27564.277) (26238.742)
## I(city): FORESTHILL/ANTELOPE -37678.394
## (110178.609)
## I(city): GALT/ANTELOPE 4447.035 -6523.542 4224.833 8326.591
## (30300.170) (26783.758) (25137.937) (23873.212)
## I(city): GARDEN VALLEY/ANTELOPE 257503.606* 280555.327** 287534.517** 373651.999***
## (110178.609) (95891.333) (89706.308) (85653.423)
## I(city): GOLD RIVER/ANTELOPE 125503.606* 164994.301** 179094.705*** 140680.788**
## (57468.346) (56996.901) (53363.501) (50829.309)
## I(city): GRANITE BAY/ANTELOPE 446236.939*** 358948.679*** 341118.095*** 311888.570***
## (65455.848) (57486.169) (53917.648) (51287.764)
## I(city): GREENWOOD/ANTELOPE 162503.606 185555.327 192534.517* 115740.023
## (110178.609) (95891.333) (89706.308) (85556.543)
## I(city): LINCOLN/ANTELOPE -135957.130*** 8004.227 7671.676 83588.350***
## (22818.455) (26400.694) (24717.040) (24774.341)
## I(city): LOOMIS/ANTELOPE 334503.606*** 304638.071*** 280971.416*** 323481.452***
## (79045.446) (68961.750) (65533.039) (62384.322)
## I(city): MATHER/ANTELOPE 5303.606 28355.327 35334.517 1643.578
## (110178.609) (95891.333) (89706.308) (85252.057)
## I(city): MEADOW VISTA/ANTELOPE -2496.394 20555.327 51680.160 55436.558
## (110178.609) (95891.333) (90105.359) (85559.138)
## I(city): NORTH HIGHLANDS/ANTELOPE -96837.061** -77132.168** -56249.069* -40658.848
## (30300.170) (26416.622) (24970.623) (23766.402)
## I(city): ORANGEVALE/ANTELOPE 46663.152 49682.863 63230.354* 62987.711*
## (37790.951) (32914.154) (30965.738) (29403.079)
## I(city): PENRYN/ANTELOPE 274191.606* 297243.327** 304222.517*** 314779.982***
## (110178.609) (95891.333) (89706.308) (85186.468)
## I(city): PLACERVILLE/ANTELOPE 131367.006*** 152123.817*** 148994.522*** 189026.637***
## (39182.492) (35528.714) (33242.944) (31841.375)
## I(city): POLLOCK PINES/ANTELOPE 7806.273 45602.720 57103.659 58101.419
## (65455.848) (57016.519) (53363.690) (50670.825)
## I(city): RANCHO CORDOVA/ANTELOPE 30909.535 -9665.643 -8903.569 -8482.638
## (27889.693) (26403.635) (24791.596) (23540.542)
## I(city): RANCHO MURIETA/ANTELOPE 65253.606 135388.071* 124217.867 69674.480
## (65455.848) (68961.750) (64534.055) (61541.907)
## I(city): RIO LINDA/ANTELOPE -59768.779 -22584.087 -4877.443 6359.794
## (35543.899) (31887.974) (29931.531) (28445.302)
## I(city): ROCKLIN/ANTELOPE 149339.430*** 133267.190*** 135829.423*** 174516.705***
## (32405.473) (30947.404) (28949.821) (27784.601)
## I(city): ROSEVILLE/ANTELOPE 92031.856*** 83211.553*** 83998.253*** 110526.580***
## (24545.943) (22217.362) (20801.039) (19944.935)
## I(city): SACRAMENTO/ANTELOPE -34760.405 -24914.927 -11384.512 -6118.953
## (19592.803) (17114.899) (16168.603) (15362.521)
## I(city): SHINGLE SPRINGS/ANTELOPE 42503.606
## (110178.609)
## I(city): SLOUGHHOUSE/ANTELOPE -230496.394* -207444.673* -325645.246*** -521444.198***
## (110178.609) (95891.333) (92914.346) (90567.159)
## I(city): WALNUT GROVE/ANTELOPE 147503.606 170555.327 201680.160* 176185.672*
## (110178.609) (95891.333) (90105.359) (85599.709)
## I(city): WEST SACRAMENTO/ANTELOPE -61796.394 -18210.494 2333.941 10543.814
## (65455.848) (96202.503) (90145.364) (85600.523)
## I(city): WILTON/ANTELOPE 385012.006*** 339685.841*** 353104.416*** 272613.956***
## (52091.231) (50292.768) (47073.435) (45482.474)
## beds_factor: 2/1 40595.665 31547.983 16944.781
## (31077.196) (29297.723) (27861.055)
## beds_factor: 3/1 84829.843** 65201.765* 36047.548
## (30353.000) (29219.995) (27912.199)
## beds_factor: 4/1 137747.100*** 86174.984** 41812.893
## (30559.967) (29972.835) (28835.382)
## beds_factor: 5/1 223881.188*** 107159.498** 44493.010
## (32627.881) (32949.484) (31964.750)
## beds_factor: 6/1 235726.979*** 3610.240 -34646.955
## (62620.331) (64775.715) (61636.641)
## beds_factor: 8/1 182700.098 31341.031 -72371.911
## (99052.713) (96153.312) (91942.041)
## baths_factor: 2/1 24145.642** 11391.362
## (9038.637) (8685.387)
## baths_factor: 3/1 124332.503*** 84097.026***
## (14063.672) (14000.282)
## baths_factor: 4/1 149325.406*** 85222.090***
## (25574.416) (25190.876)
## baths_factor: 5/1 460291.383*** 429697.673***
## (73110.206) (69494.318)
## sq_m 615.510***
## (64.323)
## ------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.408 0.499 0.564 0.607
## adj. R-squared 0.385 0.474 0.540 0.585
## sigma 108546.243 94422.256 88321.792 83864.677
## F 17.182 19.779 23.318 27.261
## p 0.000 0.000 0.000 0.000
## Log-likelihood -12798.766 -11268.866 -11208.183 -11162.241
## Deviance 11146043479543.883 7435579141776.336 6474613369753.001 5830592498359.043
## AIC 25677.531 22625.732 22512.366 22422.482
## BIC 25873.237 22835.898 22741.638 22656.531
## N 985 877 877 877
## ==============================================================================================================================
From the initial univariate regression analysis, we see that the most expensive area is Granite Ville. However upon considering square meter, number of bedrooms and bathrooms Garden Valley ends up having the most expensive proeprties, then Granite Ville.
From the above graph we note that the sales density is largest towards Sacramento and immediately adjascent areas. However, sales prices increase the further out the property is relative to Sacramento.
This is interesting because generally in large cities in the Americas, property prices tend to increaes the closer one is to the main city, for example this is observed in San Francisco, Boston, New York, and even cities outside the USA like Bogota, Colombia.
The pricing patterns are paradoxical to what one would traditionally expect.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
What we observe when plotting property prices relative to the location is that Sacramento city plays a huge role in stabilizing pricing property. As the property gets further away from Sacramento, prices begin to become less linearly related to the property’s size.
This plot helps to confirm insight provided by plot one, in which we observe that prices tend to rise the further away one gets from Sacramento.
```
In the third visualization, we note acknowledge there are other factors which can influence the price of a transaction other than location and size, such as the number of bedrooms and bathrooms.
In fact, we also note this in our multi-linear regression model, and point out how these two co-variates impact the cost of the most expensive property in this sample (see multi-linear regression section).
I was interested in analyzing this particular set of data, because it contained an appropriate number of geolocalization information to construct geospatial maps, while still allowing me to apply the different techniques learnt during the exploratory data analysis course. In addition to this, I am currently searching for property to rent and I thought it woudl be useful to leverage this opportunity to gain insight into a real life scenario.
A lot of the coding was straight forward, however I did have to try multiple different tools/techniques before being able to graph the map relative to the available data. I woudl say anything relating map data to the current data was a bit of a challenge, however geospatial maps were acomplished. Another challenge I ran into was with parsing the date-time data. In the end, I dropped the data, however in my initial attempt, I was able to separate the day from the rest of the date-time data. Ulteimately the data was not useful so I dropped it.
I learnt to transform values in a column like 0 into NaN or NA, and recently learnt that NA is a non-numeric value, while NaN is a numeric value.
Ultimately while I see the power of R to compute big data in mere seconds, and perform complex visualizations. I still prefer python. I’m also quite satisfied with the linear regression model generation. It was suprising to me that location was more indicative of pricing when compared to over variables like size. I guess the old saying “location, location, location” really makes an impact afterall. In the future this data, can be combined with a yearly account to really try and grasp deeper tendencies (afterall this data only represents sales in a 5 day period). Additionally, other variables like family size would be interesting to integrate. Finally I would recommend that the creators of the dataset reorganize the variable type, because at present it is not very informative.